Auto-tuning a Matrix Routine for High Performance

نویسندگان

  • Rune E. Jensen
  • Ian Karlin
  • Anne C. Elster
چکیده

Well-written scientific simulations typically get tremendous performance gains by using highly optimized library routines. Some of the most fundamental of these routines perform matrix-matrix multiplications and related routines, known as BLAS (Basic Linear Algebra Subprograms). Optimizing these library routines for efficiency is therefore of tremendous importance for many scientific simulations. In fact, some of them are often hand-optimized in assembly language for a given processor, in order to get the best possible performance. In this paper, we present a new tuning approach, combining a small snippet of assembly code with an auto-tuner. For our preliminary test-case, the symmetric rank-2 update, the resulting routine outperforms the best auto-tuner and vendor supplied code on our target machine, an Intel quad-core processor. It also performs less than 1.2% slower than the best hand coded library. Our novel approach shows a lot of promise for further performance gains on modern multi-core and many-core processors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Methodology for Automatically Tuned Parallel Tridiagonalization on Distributed Memory Vector-parallel Machines

In this paper, we describe an auto-tuning methodology for the parallel tridiagonalization to attain high performance. By searching the optimal set of three parameters for the performance, a highly eecient routine can be obtained automatically. Evaluation of the methodology on the distributed memory parallel machines, the HITACHI SR2201 and HITACHI SR8000, has been provided. The experimental res...

متن کامل

A Note on Auto-tuning GEMM for GPUs

The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in...

متن کامل

Offline Auto-Tuning of a PID Controller Using Extended Classifier System (XCS) Algorithm

Proportional + Integral + Derivative (PID) controllers are widely used in engineering applications such that more than half of the industrial controllers are PID controllers. There are many methods for tuning the PID parameters in the literature. In this paper an intelligent technique based on eXtended Classifier System (XCS) is presented to tune the PID controller parameters. The PID controlle...

متن کامل

Auto-tuning of level 1 and level 2 BLAS for GPUs

The use of high performance libraries for dense linear algebra operations is of great importance in many numerical scientific applications. The most common operations form the backbone of the Basic Linear Algebra Subroutines (BLAS) library. In this paper, we consider the performance and auto-tuning of level 1 and level 2 BLAS routines on GPUs. As examples, we develop single-precision CUDA kerne...

متن کامل

Auto-tuning the Matrix Powers Kernel with SEJITS

The matrix powers kernel, used in communication-avoiding Krylov subspace methods, requires runtime auto-tuning for best performance. We demonstrate how the SEJITS (Selective Embedded Just-InTime Specialization) approach can be used to deliver a high-performance and performance-portable implementation of the matrix powers kernel to application authors, while separating their high-level concerns ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011